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Exploiting  Sparsity  in  Hyperspectral  Image 
Classification  via  Graphical  Models 
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Vishal  Monga,  Senior  Member,  IEEE ,  Nasser  M.  Nasrabadi,  Fellow,  IEEE ,  and  Trac  D.  Tran,  Senior  Member,  IEEE 


Abstract — A  significant  recent  advance  in  hyperspectral  image 
(HSI)  classification  relies  on  the  observation  that  the  spectral 
signature  of  a  pixel  can  be  represented  by  a  sparse  linear  com¬ 
bination  of  training  spectra  from  an  overcomplete  dictionary.  A 
spatiospectral  notion  of  sparsity  is  further  captured  by  developing 
a  joint  sparsity  model,  wherein  spectral  signatures  of  pixels  in 
a  local  spatial  neighborhood  (of  the  pixel  of  interest)  are  con¬ 
strained  to  be  represented  by  a  common  collection  of  training 
spectra,  albeit  with  different  weights.  A  challenging  open  problem 
is  to  effectively  capture  the  class  conditional  correlations  between 
these  multiple  sparse  representations  corresponding  to  different 
pixels  in  the  spatial  neighborhood.  We  propose  a  probabilistic 
graphical  model  framework  to  explicitly  mine  the  conditional 
dependences  between  these  distinct  sparse  features.  Our  graphical 
models  are  synthesized  using  simple  tree  structures  which  can  be 
discriminatively  learnt  (even  with  limited  training  samples)  for 
classification.  Experiments  on  benchmark  HSI  data  sets  reveal 
significant  improvements  over  existing  approaches  in  classification 
rates  as  well  as  robustness  to  choice  of  training. 

Index  Terms — Classification,  hyperspectral  imagery,  joint  spar¬ 
sity  model,  probabilistic  graphical  models,  sparse  representation, 
spatial  correlation. 

I.  Introduction 

HYPERSPECTRAL  imaging  sensors  acquire  digital  im¬ 
ages  in  hundreds  of  continuous  narrow  spectral  bands 
spanning  the  visible-to-infrared  spectrum  [1].  A  pixel  in  hyper¬ 
spectral  images  (HSIs)  is  typically  a  high-dimensional  vector 
of  intensities  as  a  function  of  wavelength.  The  high  spectral 
resolution  of  the  HSI  pixels  facilitates  superior  discrimination 
of  object  types. 

In  HSI  classification,  the  class  label  of  each  pixel  is  de¬ 
termined  given  a  representative  training  set  from  each  class. 
The  support  vector  machine  (SVM)  [2],  which  solves  binary 
classification  problems  by  finding  the  optimal  separating  hy¬ 
perplane  between  the  two  classes,  has  proved  to  be  a  powerful 
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classifier  for  HSI  classification  tasks  [3].  Variants  such  as  SVM 
with  composite  kernels,  which  incorporates  spatial  information 
directly  in  the  kernels  [4],  have  led  to  improved  performance. 

Recent  work  has  highlighted  the  relevance  of  incorporating 
contextual  information  during  HSI  classification  to  improve 
performance  [4]-[7],  particularly  because  HSI  pixels  in  a  local 
neighborhood  generally  correspond  to  the  same  material  and 
have  similar  spectral  characteristics.  Many  approaches  have  ex¬ 
ploited  this  aspect,  for  example,  by  including  postprocessing  of 
individually  labeled  samples  [5],  [6]  and  Markov  random  fields 
in  Bayesian  approaches  [7].  The  composite  kernel  approach  [4] 
combines  the  spectral  and  spatial  information  from  each  HSI 
pixel  via  kernel  composition. 

An  important  recent  advance  exploits  sparsity  for  HSI  clas¬ 
sification  [8],  using  the  observation  that  spectral  signatures  of 
the  same  material  lie  in  a  subspace  of  reduced  dimensionality 
compared  to  the  number  of  spectral  bands.  An  unknown  pixel  is 
then  expressed  as  a  sparse  linear  combination  of  a  few  training 
samples  from  a  given  dictionary,  and  the  underlying  sparse 
representation  vector  encodes  the  class  information.  Further¬ 
more,  to  exploit  spatial  correlation,  a  joint  sparsity  model  is 
employed  in  [8],  wherein  neighboring  pixels  are  assumed  to  be 
represented  by  linear  combinations  of  a  few  common  training 
samples  to  enforce  smoothness  across  these  pixels. 

The  technique  in  [8]  performs  classification  by  using  (spec¬ 
tral)  reconstruction  error  computed  over  the  pixel  neighbor¬ 
hood.  Recent  work  [9]  in  model-based  compressed  sensing 
has  shown  the  benefits  of  using  probabilistic  graphical  mod¬ 
els  as  priors  on  sparse  coefficients  for  signal  (e.g.,  image) 
reconstruction  problems.  Inspired  by  this,  we  propose  to  use 
probabilistic  graphical  models  to  enforce  a  class-specific  struc¬ 
ture  on  sparse  coefficients,  wherein  our  designed  graphs  rep¬ 
resent  class  conditional  densities.  We  claim  that  the  distinct 
sparse  representations  (corresponding  to  each  pixel  in  a  spatial 
neighborhood)  resulting  from  the  joint  sparsity  model  [8]  offer 
complementary  yet  correlated  information  for  classification. 
Our  proposed  framework  then  exploits  these  class  conditional 
correlations  into  building  a  powerful  classifier.  Specifically,  a 
pair  of  discriminative  tree  graphs  [10]  is  first  learnt  for  each 
distinct  set  of  features,  i.e.,  the  sparse  representation  vectors 
of  each  pixel  in  the  local  spatial  neighborhood  of  a  central 
pixel.  These  initially  disjoint  graphs  are  then  thickened  (by 
introducing  new  edges)  into  a  richer  graphical  structure  via 
boosting  [10]— [12].  The  training  phase  of  our  graphical  model 
learning  uses  sparse  coefficients  from  all  HSI  classes,  and 
therefore,  we  learn  a  discriminative  graph-based  classifier  that 
captures  interclass  information  which  is  ignored  by  the  recon¬ 
struction  residual  in  [8].  Evaluation  on  benchmark  HSI  data 
sets  reveals  that  exploiting  the  structure  on  sparse  coefficients 


1545-598X/$31.00  ©  2012  IEEE 


506 


IEEE  GEOSCIENCE  AND  REMOTE  SENSING  LETTERS,  VOL.  10,  NO.  3,  MAY  2013 


via  class  conditional  graphs  offers  significant  improvements  in 
classification  rates.  Crucially,  our  technique  exhibits  a  more 
graceful  degradation  with  a  decrease  in  the  number  of  training 
HSI  pixels,  over  state-of-the-art  alternatives. 

II.  Background 


The  sparse  vectors  {ctt}t=i  T  share  the  same  support,  i.e., 
they  are  linear  combinations  of  the  same  collection  of  atoms 
from  D  but  with  possibly  different  weights  assigned  to  each 
atom.  As  a  result,  £  is  a  sparse  matrix  with  only  a  few  nonzero 
rows.  This  row- sparse  matrix  S  can  be  recovered  by  solving  the 
following  constrained  optimization  problem: 


A.  Sparsity  Model  for  Hyperspectral  Classification 


The  HSI  sparsity  model  is  an  extension  of  the  sparse- 
representation-based  framework  first  introduced  for  face  recog¬ 
nition  [13].  This  model  relies  on  the  key  observation  that 
the  spectral  signatures  of  pixels  approximately  lie  in  a  low¬ 
dimensional  subspace  spanned  by  representative  training  pixels 
from  the  same  class.  Consequently,  for  a  test  pixel  whose 
class  identity  is  unknown,  there  exists  a  sparse  representation 
in  terms  of  training  samples  from  all  classes.  Let  y  G  be 
a  pixel  with  B  indicating  the  number  of  spectral  bands  and 
Dm  G  m—  1,2,...,  M,  be  the  subdictionary  whose 

columns  are  the  Nm  training  samples  from  the  rath  class.  The 
HSI  pixel  y  can  then  be  written  as 


y  —  Dicxi~\~- •  --\-Dmolm  —  [D\ 


D 


Ml 


Oil 


OiM 


=  Doc 


(1) 


where  D  G  RBxN  with  N  =  Y^m= 1  ^ m  is  a  structured  dic¬ 
tionary  consisting  of  training  samples  (referred  to  as  atoms) 
from  all  classes  and  a  G  M,N  is  a  sparse  vector.  Given  the 
overcomplete  dictionary  D ,  the  sparse  coefficient  vector  a  is 
obtained  by  solving  the  following  optimization  problem: 


a  =  argmin  ||a||o  subject  to  \\y  —  Dct\\2  <  £  (2) 


where  e  is  a  suitably  chosen  reconstruction  error  tolerance.  The 
sparse  vector  a  can  be  recovered  efficiently  using  many  norm 
minimization  techniques,  including  greedy  algorithms  or  ti~ 
norm  relaxation  [14].  The  class  label  of  y  is  finally  determined 
by  the  minimal  residual  between  y  and  its  approximation  from 
each  class  subdictionary 

Class(y)  =  arg  min  \\y  -  Dmam\\2  (3) 


where  am  is  the  collection  of  coefficients  in  a  corresponding 
to  the  rath-class  subdictionary. 


B.  Joint  Sparsity  Model 

HSIs  are  usually  smooth  in  the  sense  that  pixels  within  a 
small  neighborhood  usually  represent  the  same  material,  and 
thus,  their  spectral  characteristics  are  highly  correlated.  In 
order  to  incorporate  this  spatial  correlation  information,  the 
joint  sparsity  model  [15]  is  employed  for  HSI  classification 
in  [8]  by  assuming  that  the  sparse  vectors  associated  with 
pixels  in  a  local  spatial  neighborhood  share  a  common  sparsity 
pattern.  Specifically,  let  {yt}t=1  T  be  T  pixels  in  a  spatial 
neighborhood  centered  at  yx.  These  neighboring  pixels  can  be 
expressed  as 


Y  =  [y1  y2  •  •  •  yT]  =  [Da  i  Da2 
=  D[oi  i  a.2  •••  ax]  =  DS. 

s 


Dolt] 

(4) 


S  =  argmin  \\Y  —  DS||ir  subject  to  ||Sj|row  o  <  Ko 

(5) 

where  ||S||r0w,o  denotes  the  number  of  nonzero  rows  of  S 
and  ||  •  ||  p  is  the  Frobenius  norm.  The  problem  in  (5)  can  be 
approximately  solved  by  the  greedy  simultaneous  orthogonal 
matching  pursuit  (SOMP)  algorithm  [15].  The  identity  of  yx  is 
then  determined  by  the  minimal  total  residual 

Class (j/i)  =  arg  min  \\Y  -  DmSm\\F  (6) 
where  Sm  contains  the  rows  of  S  associated  with  Dm. 


C.  Probabilistic  Graphical  Models 

A  graph  Q  =  (y,£)  is  a  collection  of  nodes  V  = 
{ui, . . . ,  vr}  and  a  set  of  (undirected)  edges  £  C  ()f)  •  A  prob¬ 
abilistic  graphical  model  describes  the  joint  distribution  of  a 
random  vector  with  each  node  representing  one  (or  a  group 
of)  random  variable(s)  whose  conditional  dependences  are  indi¬ 
cated  by  the  presence  of  connecting  edges.  The  graph  structure 
leads  to  a  factorization  of  the  joint  probability  distribution  of  the 
random  vector  in  terms  of  marginal  and  pairwise  statistics.  The 
Hughes  phenomenon  [16]  highlights  the  difficulty  of  learning 
models  for  high-dimensional  data  with  limited  number  of  train¬ 
ing  samples.  The  use  of  probabilistic  graphs  reduces  sensitivity 
to  choice  of  training,  particularly  in  the  low-training  regime 
[17,  Ch.  8],  [18]. 

Graphical  models  can  be  learnt  either  generatively  or  dis- 
criminatively.  In  the  generative  setting,  a  single  graph  which 
approximates  a  given  distribution  is  learnt  by  minimizing  the 
approximation  error.  The  seminal  contribution  in  this  area  is 
due  to  Chow  and  Liu  [19],  who  obtained  the  optimal  tree 
approximation  p  of  a  multivariate  distribution  p  by  minimizing 
the  Kullback-Leibler  (KL)  distance  D(p\\p)  =  Ep[log(p/p)] 
using  first-  and  second-order  statistics,  via  a  maximum-weight 
spanning  tree  (MWST)  problem.  In  discriminative  learning, 
a  pair  of  graphs  is  jointly  learnt  by  minimizing  the  classi¬ 
fication  error.  Recently,  a  discriminative  learning  framework 
has  been  proposed  [10]  by  maximizing  the  tree-approximate 
J-divergence  (a  symmetric  extension  of  the  KL  distance) 

dx.  (7) 


J  (p{x) 


q(x))  log 


Pi 

?( 


Based  on  the  observation  that  maximizing  the  J-divergence 
minimizes  the  upper  bound  on  the  probability  of  classification 
error,  the  discriminative  learning  problem  then  becomes 

(P,  q )  =  arg  max  J (p,  q ;  p,  q)  (8) 

p,q  are  trees 


where  p  and  q  are  the  empirical  estimates  of  p  and  q ,  respec¬ 
tively.  The  problem  in  (8)  is  shown  to  decouple  into  two  MWST 
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problems  [10] 

p  =  arg  min  D(p\\p)  —  D(q\\p) 

p  is  a  tree 

q  =  arg  min  D(q\\q)  -  D(p\\q).  (9) 

q  is  a  tree 

The  optimal  choice  of  p  (q)  simultaneously  minimizes  its 
distance  to  p  (q)  and  maximizes  its  distance  from  q  (p). 

III.  Exploiting  Joint  Sparsity  via  Probabilistic 
Graphical  Models 

Learning  Thicker  Graphical  Models 

The  solution  to  (9)  learns  tree -structured  graphs  p  and  q , 
which,  albeit  learned  optimally,  can  model  only  a  small  family 
of  distributions  due  to  their  simple  edge  structure.  However, 
optimally  learning  complex  graphical  models  is,  in  general,  NP- 
hard  [20].  This  problem  is  practically  addressed  by  boosting 
simpler  graphs  [10]— [12]  into  richer  structures.  Recently,  we 
have  proposed  a  feature  fusion  framework  [21]  for  image  clas¬ 
sification  where  the  “initial  graphical  structure”  for  boosting 
is  chosen  as  a  forest  of  disjoint  tree  graphs.  Thickening  this 
forest  with  new  edges  is  hence  tantamount  to  discovering  new 
conditional  correlations  between  distinct  feature  sets. 


Algorithm  1  LSGM  (steps  1-4  offline) 


1:  Feature  extraction  (training):  Compute  sparse  representations 
on,  l  —  1, . . . ,  T,  for  neighboring  pixels  of  the  training  data. 

2:  Initial  disjoint  graphs: 

Discriminatively  learn  T  pairs  of  iV-node  tree  graphs  Qf  and  Gf  on 
{cti},  for  l  =  1, . . . ,  T,  obtained  from  training  data. 

3:  Separately  concatenate  nodes  corresponding  to  the  two  classes, 
to  generate  initial  graphs. 

4:  Boosting  on  disjoint  graphs:  Iteratively  thicken  initial  disjoint 
graphs  via  boosting  to  obtain  final  graphs  Qp  and  Qq. 

{Online  process} 

5:  Feature  extraction  (test):  Obtain  sparse  representations  on,l  = 
1, . . . ,  T,  in  RN  from  test  image. 

6:  Inference:  Classify  based  on  the  output  of  the  resulting  classifier 
using  (10). 


We  leverage  the  framework  in  [21]  for  HSI  classification. 
Here,  the  generation  of  distinct  and  complementary  feature 
sets  is  achieved  by  solving  the  joint  sparse  recovery  problem 
in  (5)  (with  distinct  feature  sets  being  the  sparse  coefficient 
vectors/columns  of  S).  This  is  shown  in  Fig.  1,  and  the  formal 
description  of  our  proposed  local  sparsity  graphical  model 
(LSGM)  algorithm  is  provided  in  Algorithm  1. 

Note  that  the  proposed  LSGM  algorithm  consists  of  an 
offline  training  stage  (steps  1-4)  and  an  online  classification 
stage  (steps  5  and  6).  The  local  sparsity  in  the  name  is  in¬ 
dicative  of  the  joint  sparsity  model  that  is  used  to  obtain 
the  local  sparse  features.  The  discriminative  graphs  are  learnt 
in  the  training  stage.  The  process  described  here  is  for  bi¬ 
nary  classification.  The  approach  extends  to  multiclass  prob¬ 
lems  by  learning  graphs  in  a  one-against-all  manner.  For  an 
M-class  classification  problem,  we  learn  M  pairs  of  dis¬ 
criminative  graphs  that  represent  the  class  conditional  prob¬ 
ability  density  functions  f(ct\Cm)  and  f(ct\Cm)  for  m  = 
1,2,...,  M,  where  Cm  denotes  the  rath  class  and  Cm  denotes 
the  complement  of  Cm  (i.e.,  Cm  =  Ufc=i GO- 

We  first  obtain  the  feature  vectors  (i.e.,  sparse  vectors  with 
respect  to  a  given  training  dictionary  D)  of  training  samples 


4  . Sparse  features) 

^  1  «1  I 


.  Sparse  lr; 

^  -  /  |  or  I 


/(Oi|C„)  /(0|K\„) 
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A, 

A. 

/V 

/We.)’ 

A 

A 

Fig.  1.  HSI  classification  using  discriminative  graphical  models  on  sparse 
feature  representations  obtained  from  local  pixel  neighborhoods. 

and  their  neighboring  pixels  by  solving  the  joint  sparse  recovery 
problem  in  (5).  Let  T  be  the  size  of  the  neighborhood.  The 
extraction  of  sparse  features  may  be  viewed  as  a  transformation 
Ti  :  i — y  Rn  ,  and  there  are  T  such  distinct  transformations 

71,  l  =  1,  2, . . . ,  T.  For  every  pixel  y  G  RB ,  T  different  fea¬ 
tures  cxi  G  l  =  1,  2, . . . ,  T,  are  obtained,  as  shown  in  Fig.  1 
for  a  3  x  3  neighborhood  with  T  —  9  (only  three  features  are 
displayed).  For  each  type  of  feature,  training  features  for  class 
Cm  correspond  to  pixels  in  a  neighborhood  of  training  samples 
known  to  belong  to  class  Cm.  Features  for  Cm  are  the  sparse 
vectors  associated  with  neighbors  of  representative  training. 

For  each  of  the  T  transformations  7/,  a  pair  of  7V-node 
discriminative  tree  graphs  Gf  and  Gf,  which  approximate  the 
class  distributions  f(cti\Cm)  and  f(cti\Cm),  respectively,  is  si¬ 
multaneously  learnt  by  solving  the  decoupled  MWST  problems 
in  (9).  The  initial  disjoint  graphs  with  TN  nodes  representing 
the  class  distribution  corresponding  to  Cm  and  Crn  are  then 
generated  by  separately  concatenating  the  nodes  of  Gf,  l  = 
1, . . . ,  T,  and  Gf,  l  =  1, . . . ,  T,  respectively.  These  graphs  with 
sparse  edge  structure  are  then  iteratively  thickened  via  boosting 
[21].  Different  pairs  of  discriminative  graphs  over  the  same  sets 
of  nodes  with  different  weights  are  learnt  in  different  iterations, 
and  the  newly  learnt  edges  are  used  to  augment  the  graphs. 
The  final  “thickened”  graphs  Qp  and  Gq  are  shown  in  Fig.  1 
(right  side). 

The  process  described  earlier  (steps  1-4  in  Algorithm  1)  is 
performed  offline,  and  M  pairs  of  discriminative  graphs  are 
learnt  for  the  M  binary  classification  problems  in  a  one-against- 
all  manner.  The  classification  of  a  new  test  sample  is  then 
performed  online.  Features  a  are  extracted  from  the  test  sample 
y  by  solving  the  sparse  recovery  problem  in  (5)  for  the  T  pixels 
in  the  neighborhood  centered  at  y.  Let  f(ct\Cm)  and  f(cx\Cm) 
denote  the  final  graphs  learnt  for  Cm  and  Cm,  respectively.  The 
class  label  of  y  is  determined  as  follows: 

Class (y)  =  arg  max  log  f  .  (10) 

^  \f(OL\Cm)  ) 


IV.  Experiments  and  Results 

We  compare  our  proposed  LSGM  approach  with  three  com¬ 
petitive  methods:  1)  spectral-feature-based  SVM  classifier  [3], 
[22];  2)  composite  kernel  SVM  (SVM-CK)  [4];  and  3)  joint 
sparsity  model  (SOMP)  [8].  In  SVM-CK,  two  types  of  kernels 
are  used:  a  spectral  kernel  Ku  for  the  spectral  (pixel)  fea¬ 
tures  (in  M200)  and  a  spatial  kernel  Ks  for  the  spatial  fea¬ 
tures  (in  M400)  which  are  formed  by  the  mean  and  standard 
deviation  of  pixels  in  a  neighborhood  per  spectral  channel. 
A  polynomial  kernel  (order  d  =  7)  is  used  for  the  spectral 
features,  while  the  radial  basis  function  (RBF)  kernel  is  used 
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TABLE  I 


TABLE  III 


Classification  Rates  for  the  AVIRIS  Indian 


Classification  Rates  for  the  Center  of 


Pines  Test  Set.  LSGMz-Score^  -2.13 


Pavia  Test  Set.  LSGMz-Score^  -2.17 


Class  type 

Training 

Test 

SVM 

SVM-CK 

SOMP 

LSGM 

Alfalfa 

6 

48 

83.33 

95.83 

87.50 

89.58 

Com-notill 

144 

1290 

88.06 

96.82 

94.80 

95.50 

Corn-min 

84 

750 

72.40 

91.20 

94.53 

94.80 

Corn 

24 

210 

60.48 

87.62 

93.33 

94.76 

Grass/pasture 

50 

447 

92.39 

93.74 

89.71 

90.82 

Grass/trees 

75 

672 

96.72 

97.62 

98.51 

99.55 

Pasture-mowed 

3 

23 

47.82 

73.91 

91.30 

91.30 

Hay-windrowed 

49 

440 

98.41 

98.86 

99.32 

99.55 

Oats 

2 

18 

50.00 

55.56 

0 

44.44 

Soybeans-notill 

97 

871 

72.91 

94.26 

89.44 

90.93 

Soybeans-min 

247 

2221 

85.14 

94.73 

97.03 

97.39 

Soybeans-clean 

62 

552 

86.23 

93.84 

88.94 

92.39 

Wheat 

22 

190 

99.47 

99.47 

100 

100 

Woods 

130 

1164 

93.73 

99.05 

99.57 

99.65 

Building-trees 

38 

342 

63.45 

88.01 

98.83 

99.71 

Stone-steel 

10 

85 

87.05 

100 

97.65 

98.82 

Overall 

1043 

9323 

85.11 

95.15 

95.31 

96.18 

TABLE  II 

Classification  Rates  for  the  University  of 
Pavia  Test  Set.  LSGMz-Score=  -2.01 


Class  type 

Training 

Test 

SVM 

SVM-CK 

SOMP 

LSGM 

Asphalt 

548 

6304 

84.01 

80.20 

59.49 

66.55 

Meadows 

540 

18146 

67.50 

84.99 

78.31 

86.10 

Gravel 

392 

1815 

67.49 

82.37 

84.13 

86.72 

Trees 

524 

2912 

97.32 

96.33 

96.30 

96.94 

Metal  sheets 

265 

1113 

99.28 

99.82 

87.78 

98.83 

Bare  soil 

532 

4572 

92.65 

93.35 

77.45 

94.62 

Bitumen 

375 

981 

89.70 

90.21 

98.67 

99.18 

Bricks 

514 

3364 

92.24 

92.95 

89.00 

94.44 

Shadows 

231 

795 

96.73 

95.85 

91.70 

96.10 

Overall 

3921 

40002 

79.24 

87.33 

78.75 

86.38 

(a)  (b)  (c)  (d) 


Fig.  2.  Difference  maps  for  the  AVIRIS  Indian  Pines  data  set,  for  the  ground 
truth  map  in  (a),  (b)  SVM-CK  [4].  (c)  SOMP  [8].  (d)  Proposed  LSGM 
approach. 


for  the  spatial  features.  The  a  parameter  for  the  RBF  ker¬ 
nel  and  the  SVM  regularization  parameter  C  are  selected  by 
cross-validation.  The  weighted  summation  kernel  K  =  fiKs  + 
(1  —  ii)Ku  effectively  captures  spectral  and  contextual  spatial 
information,  with  the  optimal  choice  n  =  0.4  determined  by 
cross-validation.  A  5  x  5  window  is  used  for  the  neighborhood 
kernels.  The  parameters  for  SOMP  are  chosen  as  described  in 
[8].  The  proposed  LSGM  approach  uses  a  local  window  of 
dimension  3x3.  For  fairness  of  comparison,  results  for  SOMP 
are  also  presented  for  the  same  window  dimension. 

We  perform  experiments  using  three  distinct  HSI  data  sets. 
Note  that  two  flavors  of  the  results  are  reported:  1)  Tables  I— III 
show  the  classification  rates  for  carefully  selected  or  good 
training  samples  which  amount  to  about  10%  of  available  data 
(typical  of  training  choices  in  [4]  and  [8]),  and  2)  Fig.  3(a)-(f) 
shows  the  performance  plotted  as  a  function  of  training  set 
size  and  also  the  results  averaged  from  multiple  (ten)  random 
training  runs.  Fig.  3(d)-(f)  characterizes  the  distribution  of 
the  classification  rates  (modeled  as  a  random  variable  whose 
value  emerges  as  an  outcome  of  a  given  run  and  fit  to  a 
Gaussian  distribution).  Furthermore,  we  establish  the  statistical 
significance  of  our  results  by  computing  LSGM  z-scores  for 
each  data  set. 


Class  type 

Training 

Test 

SVM 

SVM-CK 

SOMP 

LSGM 

Water 

745 

64533 

99.19 

97.61 

99.38 

99.44 

Trees 

785 

5722 

77.74 

92.99 

91.98 

92.99 

Meadow 

797 

2094 

86.72 

97.37 

95.89 

96.99 

Brick 

485 

1667 

40.37 

79.60 

86.44 

87.28 

Soil 

820 

5729 

97.52 

98.65 

96.75 

97.64 

Asphalt 

678 

6847 

94.77 

94.37 

93.79 

94.54 

Bitumen 

808 

6479 

74.37 

97.53 

95.06 

96.99 

Tile 

223 

2899 

98.94 

99.86 

99.83 

99.90 

Shadow 

195 

1970 

100 

99.89 

98.48 

99.34 

Overall 

5536 

97940 

94.63 

96.97 

97.82 

98.20 

A.  AVIRIS  Data  Set:  Indian  Pines 

The  first  HSI  in  our  experiments  is  the  Airborne  Visible/ 
Infrared  Imaging  Spectrometer  (AVIRIS)  Indian  Pines  image 
[23].  The  AVIRIS  sensor  generates  220  bands  across  the  spec¬ 
tral  range  from  0.2  to  2.4  /rm,  of  which  only  200  bands  are 
considered  by  removing  20  water  absorption  bands  [22].  This 
image  has  a  spatial  resolution  of  20  m  per  pixel  and  a  spatial 
dimension  of  145  x  145.  For  well-chosen  training  samples, 
the  difference  maps  obtained  using  the  different  approaches  are 
shown  in  Fig.  2(b)-(d),  and  the  classification  rates  for  each 
class,  as  well  as  the  overall  accuracy,  are  shown  in  Table  I. 
The  improvement  over  SOMP  indicates  the  benefits  of  using 
a  discriminative  classifier  instead  of  reconstruction  residuals 
for  class  assignment  while  still  retaining  the  advantages  of 
exploiting  spatio spectral  information. 

Fig.  3(a)  shows  a  comparison  of  algorithm  performances  as  a 
function  of  training  set  size.  Our  LSGM  approach  outperforms 
the  competing  approaches,  and  the  difference  is  particularly 
significant  in  the  low-training  regime.  As  expected,  the  overall 
classification  accuracy  decreases  when  the  number  of  training 
samples  is  reduced.  With  that  said,  LSGM  offers  a  more  grace¬ 
ful  degradation  in  comparison  to  other  approaches.  In  Fig.  3(d), 
the  average  classification  rate  is  the  highest  for  LSGM,  which 
is  consistent  with  the  results  in  Fig.  3(a).  Furthermore,  the 
variance  is  the  lowest  for  LSGM,  underlining  its  improved 
robustness  against  particular  choice  of  training  samples. 


B.  ROSIS  Urban  Data  Over  Pavia,  Italy 

The  next  two  HSIs,  University  of  Pavia  and  Center  of  Pavia, 
are  urban  images  acquired  by  the  Reflective  Optics  System 
Imaging  Spectrometer  (ROSIS).  The  ROSIS  sensor  generates 
1 15  spectral  bands  ranging  from  0.43  to  0.86  m  and  has  a  spatial 
resolution  of  1.3  m  per  pixel.  The  University  of  Pavia  image 
consists  of  610  x  340  pixels,  each  having  103  bands  with  the 
12  noisiest  bands  removed.  The  Center  of  Pavia  image  consists 
of  1096  x  492  pixels,  each  having  102  spectral  bands  after  13 
noisy  bands  are  removed.  For  these  two  images,  we  repeat  the 
experimental  scenarios  tested  in  Section  IV-A. 

The  classification  rates  for  the  two  ROSIS  images  are  pro¬ 
vided  in  Tables  II  and  III,  respectively,  for  the  scenario  of  well- 
chosen  training  samples.  In  Table  II,  the  SVM-CK  technique 
performs  marginally  better  than  LSGM  in  the  sense  of  overall 
classification  accuracy.  However,  for  most  individual  classes, 
LSGM  does  better,  particularly  in  cases  where  the  training 
sample  size  is  smaller.  In  Table  III,  LSGM  performs  better 
than  SOMP  as  well  as  SVM-CK.  From  Fig.  3(b)  and  (c),  we 
observe  that  LSGM  improves  upon  the  performance  of  SOMP 
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(a) 


Percentage  of  training  samples 


(b) 


(c) 


(d)  (e)  (f) 


Fig.  3.  Performance  of  different  approaches  as  a  function  of  the  number  of  training  samples  provided,  (a)  AVIRIS  image,  (b)  University  of  Pavia  image, 
(c)  Center  of  Pavia  image,  (d)-(f)  Density  function  of  the  classification  rates  obtained  for  ten  different  random  realizations  of  training. 


and  SVM-CK  by  about  4%,  while  the  improvements  over  the 
baseline  SVM  classifier  are  even  more  pronounced. 

The  2- score  for  LSGM  on  the  AVIRIS  image  is  —2. 13,  which 
indicates  that,  with  a  high  probability  (=  0.983),  any  random 
selection  of  training  samples  will  give  results  similar  to  the 
values  in  Table  I.  For  the  University  of  Pavia  and  Center  of 
Pavia  images,  the  z-scores  are  —2.01  and  —2.17,  respectively. 
The  negative  sign  merely  indicates  that  the  experimental  value 
is  lesser  than  the  most  likely  value  (Gaussian  mean). 

V.  Conclusion 

Linear  reconstruction  models  that  impose  sparsity  are  gain¬ 
ing  increasing  popularity  in  HSI  classification.  A  spatio spectral 
notion  of  sparsity  is  exploited  in  the  current  work  by  posing 
a  structure  on  sparse  coefficients  via  discriminative  graphi¬ 
cal  models.  Results  show  marked  improvement  over  powerful 
state-of-the-art  classifiers,  particularly  in  the  form  of  robustness 
to  choice  and  number  of  training  hyperspectral  profiles. 
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